Detector and method for voice activity detection

ABSTRACT

The embodiments of the present invention relates to a voice activity detector and a method thereof. The voice activity detector is configured to detect voice activity in a received input signal comprising an input section configured to receive a signal from a primary voice detector of said VAD indicative of a primary VAD decision and at least one signal from at least one external VAD indicative of a voice activity decision from the at least one external VAD, a processor configured to combine the voice activity decisions indicated in the received signals to generate a modified primary VAD decision, and an output section configured to send the modified primary VAD decision to a hangover addition unit of said VAD.

RELATED APPLICATIONS

This application is filed under 35 U.S.C. §371 as a National StageApplication of International Patent App. No. PCT/SE2010/051118 filedOct. 18, 2010, and claims priority to U.S. 61/376,815 filed Aug. 25,2010, U.S. 61/252,858 filed Oct. 19, 2009, U.S. 61/252,966 filed Oct.19, 2009, and U.S. 61/262,583 filed Nov. 19, 2009, each of which isincorporated herein by reference in its entirety.

TECHNICAL FIELD

The present invention relates to a method and a voice activity detectorand in particular to an improved voice activity detector for handlinge.g. non stationary background noise.

BACKGROUND

In speech coding systems used for conversational speech it is common touse discontinuous transmission (DTX) to increase the efficiency of theencoding. The reason is that conversational speech contains largeamounts of pauses embedded in the speech, e.g. while one person istalking the other one is listening. So with DTX the speech encoder isonly active about 50 percent of the time on average and the rest can beencoded using comfort noise. Some example codecs that have this featureare the AMR NB (Adaptive MultiRate Narrowband).

For high quality DTX operation, i.e. without degraded speech quality, itis important to detect the periods of speech in the input signal this isdone by the Voice Activity Detector (VAD). FIG. 1 shows an overviewblock diagram of a generalized VAD 180, which takes the input signal100, divided into data frames, 5-30 ms depending on the implementation,as input and produces VAD decisions as output 160. I.e. a VAD decision160 is a decision for each frame whether the frame contains speech ornoise).

The generic VAD 180 comprises a background estimator 130 which providessubband energy estimates and a feature extractor 120 providing thefeature subband energy. For each frame, the generic VAD calculatesfeatures and to identify active frames the feature(s) for the currentframe are compared with an estimate of how the feature “looks” for thebackground signal.

The primary decision, “vad_prim” 150, is made by a primary voiceactivity detector 140 and is basically just a comparison of the featuresfor the current frame and the background features (estimated fromprevious input frames), where a difference larger than a thresholdcauses an active primary decision. The hangover addition block 170 isused to extend the VAD decision from the primary VAD based on pastprimary decisions to form the final VAD decision, “vad_flag” 160, i.e.older VAD decisions are also taken into account. The reason for usinghangover is mainly to reduce/remove the risk of mid speech and backendclipping of speech bursts. However, the hangover can also be used toavoid clipping in music passages. An operation controller 110 may adjustthe threshold(s) for the primary detector and the length of the hangoveraddition according to the characteristics of the input signal.

There are a number of different features that can be used for VADdetection, one feature is to look just at the frame energy and comparethis with a threshold to decide if the frame comprises speech or not.This scheme works reasonably well for conditions where the SNR is goodbut not for low SNR cases. In low SNR it is instead required to useother metrics comparing the characteristics of the speech and noisesignals. For real-time implementations an additional requirement of VADfunctionality is computational complexity and this is reflected in thefrequent representation of subband SNR VADs in standard codecs e.g. AMRNB, AMR WB (Adaptive Multi-Rate WideBand) and G.718 (ITU-Trecommendation embedded scalable speech and audio codec).

While the subband SNR based VAD combines the SNR's of the differentsubbands to a metric which is compared to a threshold for the primarydecision. In the subband based VAD, the SNR is determined for eachsubband and a combined SNR is determined based on those SNRs. Thecombined SNR, may be a sum of all SNRs on different subbands. There arealso known solutions where multiple features with differentcharacteristics are used for the primary decision. However, in bothcases there is just one primary decision that is used for addinghangover, which may be adaptive to the input signal conditions, to formthe final decision. Also many VAD's have an input energy threshold forsilence detection, i.e. for input levels that are low enough, theprimary decision is forced to the inactive state.

For VADs based on subband SNR principle it has been shown that theintroduction of a non-linearity in the subband SNR calculation, calledsignificance thresholds, can improve VAD performance for conditions withnon-stationary noise (babble, office). Non-stationary noise can bedifficult for all VADs, especially under low SNR conditions, whichresults in a higher VAD activity compared to the actual speech andreduced capacity from a system perspective. Of the non-stationary noisethe most difficult is babble noise and the reason is that itscharacteristics are relatively close to the speech signal the VAD isdesigned to detect. Babble noise is usually characterized both by theSNR relative to the speech level of the foreground speaker and thenumber of background talkers, where a common definition (as used insubjective evaluations) is that babble should have 40 or more backgroundspeakers, the basic motivation being that for babble it should not bepossible to follow any of the included speakers in the babble noise (nonof the babble speakers shall become intelligible). It should also benoted that with an increasing number of talkers in the babble noise itbecomes more stationary. With only one (or a few) speaker(s) in thebackground they are usually called interfering talker(s). A furtherproblematic issue is that babble noise may have spectral variationcharacteristics very similar to some music pieces that the VAD algorithmshall not suppress.

In the previously mentioned VAD solutions AMR NB/WB and G.718 there arevarying degrees of problem with babble noise in some cases already atreasonable SNRs (20 dB). The result is that the assumed capacity gainfrom using DTX can not be realized. In real mobile phone systems it hasalso been noted that it may not be enough to require reasonable DTXoperation in 15-20 dB SNR. If possible one would desire reasonable DTXoperation down to 5 dB even 0 dB depending on the noise type. For lowfrequency background noise an SNR gain of 10-15 dB can be achieved forthe VAD functionality just by highpass filtering the signal before VADanalysis. Due to the similarity of babble to speech the gain fromhighpass filtering the input signal is very low.

From a quality point of view it is better to use a failsafe VAD, meaningthat when in doubt it is better for the VAD to signal speech input andjust allow for a large amount of extra activity. This may, from a systemcapacity point view, be acceptable as long as only a few of the usersare in situations with non-stationary background noise. However, with anincreasing number of users in non-stationary environments the usage offailsafe VAD may cause significant loss of system capacity. It istherefore becoming important to work on pushing the boundary betweenfailsafe and normal VAD operation so that a larger class ofnon-stationary environments are handled using normal VAD operation.

Though the usage of significance thresholds which improves VADperformance it has been noted that it may also cause occasional speechclippings, mainly front end clippings of low SNR unvoiced sounds.

For existing solutions when a new problem area is identified it can bedifficult to find a new tuning of an existing VAD that does not changethe behavior of the VAD for already working conditions. That is, whileit would be possible to change the tuning to cope with the new problem,it may not be possible to make the tuning without changing the behaviorin already known conditions.

SUMMARY

The embodiments of the present invention provides a solution forretuning existing VAD's to handle non-stationary backgrounds or otherdiscovered problem areas.

Thus by allowing multiple VAD's to work in parallel and then combine theoutputs, it is possible to exploit the strengths from the differentVAD's without suffering too much from each VAD's limitations.

In one embodiment to be used in situations when one wants to reduceexcessive activity, the primary decision of the first VAD is combinedwith a final decision from an external VAD by a logical AND. Theexternal VAD is preferably more aggressive than the first VAD. Anaggressive VAD implies a VAD which is tuned/constructed to generatelower activity compared to a “normal” VAD. The main purpose of anaggressive VAD is that it should reduce the amount of excessive activitycompared to a normal/original VAD. Note that this aggressiveness onlymay apply to some particular (or limited number of) condition(s) e.g.concerning noise types or SNR's.

Another embodiment can be used in situations when one wants to addactivity without causing excessive activity, the primary decision of thefirst VAD may in this embodiment be combined with a primary decisionfrom an external VAD by a logical OR.

Thus according to a first aspect of embodiments of the present inventiona method in a voice activity detector (VAD) for detecting voice activityin a received input signal is provided. In the method, a signal isreceived from a primary voice detector of said VAD indicative of aprimary VAD decision and at least one signal is received from at leastone external VAD indicative of a voice activity decision from the atleast one external VAD. The voice activity decisions indicated in thereceived signals are combined to generate a modified primary VADdecision, and the modified primary VAD decision is sent to a hangoveraddition unit of said VAD.

According to a second aspect of embodiments of the present invention, avoice activity detector (VAD) is provided. The VAD is configured todetect voice activity in a received input signal comprising an inputsection configured to receive a signal from a primary voice detector ofsaid VAD indicative of a primary VAD decision and at least one signalfrom at least one external VAD indicative of a voice activity decisionfrom the at least one external VAD. The VAD further comprises aprocessor configured to combine the voice activity decisions indicatedin the received signals to generate a modified primary VAD decision andan output section configured to send the modified primary VAD decisionto a hangover addition unit of said VAD.

By combining an existing VAD with one or more external VAD's it ispossible to improve overall VAD performance with only minor effect oninternal states of the original VAD—which may be a requirement for othercodec functions, e.g. frame classification and codec mode selection.

A further advantage with embodiments of the present invention is thatthe use of multiple VAD's does not affect normal operation, i.e. whenthe SNR of the input signal is good. It is only when the normal VADfunction is not good enough that the external VAD should make itpossible to extend the working range of the VAD.

If the external VAD works properly for the noise causing problems, thesolution of an embodiment allows the external VAD to override theprimary decision from the first VAD, i.e. preventing false activity onbackground noise only.

Further, addition of more external VADs makes it possible to reduce theamount of excessive activity or allow detection of additional previouslyclipped speech (or audio). Adaptation of the combination logic to thecurrent input conditions may be needed to prevent that the externalVAD's increase the excessive activity or introduce additional speechclipping. The adaptation of the combination logic could be such that theexternal VAD's are only used during input conditions (noise level, SNR,or nose characteristics [stationary/non-stationary]) where it has beenidentified that the normal VAD is not working properly.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows a generic VAD with background estimation according to priorart.

FIGS. 2-5 show generic VAD with background estimation including themulti VAD combination logic according to embodiments of the presentinvention.

FIG. 6 discloses a combination logic according to embodiments of thepresent invention.

FIG. 7 is a flowchart of a method according to embodiments of thepresent invention.

DETAILED DESCRIPTION

The embodiments of the present invention will be described more fullyhereinafter with reference to the accompanying drawings, in whichpreferred embodiments of the invention are shown. The embodiments may,however, be embodied in many different forms and should not be construedas limited to the embodiments set forth herein; rather, theseembodiments are provided so that this disclosure will be thorough andcomplete, and will fully convey the scope of the invention to thoseskilled in the art. In the drawings, like reference signs refer to likeelements.

Moreover, those skilled in the art will appreciate that the means andfunctions explained herein below may be implemented using softwarefunctioning in conjunction with a programmed microprocessor or generalpurpose computer, and/or using an application specific integratedcircuit (ASIC). It will also be appreciated that while the currentembodiments are primarily described in the form of methods and devices,the embodiments may also be embodied in a computer program product aswell as a system comprising a computer processor and a memory coupled tothe processor, wherein the memory is encoded with one or more programsthat may perform the functions disclosed herein.

FIG. 2 shows a first VAD 199 with background estimation as in FIG. 1. Adifference is that the VAD further comprises a combination logic 145according to a first embodiment of the present invention. In thisembodiment, the performance of the first VAD is improved with theintroduction of an external vad_flag_HE 190 from an external VAD 198 tothe combination logic 145 which is introduced before the hangoveraddition 170. It should be noted that the way the external VAD 198 isused will not affect the primary voice activity detector 140 and thenormal behaviour of the VAD during good SNR conditions. By forming thenew primary decision referred to as vad_prim′ 155 in the combinationlogic 145 through a logical AND between the primary decision vad_primfrom the first VAD and the final decision referred to as vad_flag_he 190from the external VAD 198, this results in that excessive activity ofthe VAD can be avoided. The first embodiment is also shown in FIG. 3which also schematically illustrates the external VAD VAD2. FIG. 3 isfurther explained below.

With the external VAD according to the embodiments described above, itis possible to reduce the excessive activity for additional noise types.This is achieved as the external VAD can prevent false active signalsfrom the original VAD. Excessive activity implies that the VAD indicatesactive speech for frames which only comprise background noise. Thisexcessive activity is usually a result of 1) non-stationary speech likenoise (babble) or 2) that the background noise estimation is not workingproperly due to non-stationary noise or other falsely detected speechlike input signals.

According to a second embodiment, the combination logic forms a newprimary decision referred to as vad_prim′ through a logical OR betweenthe primary decision vad_prim from the first VAD and the primarydecision referred to as vad_prim_HE from the external VAD. In this wayit is possible to add activity to correct undesired clipping performedby the first VAD.

The second embodiment is illustrated in FIG. 4 which also shows theexternal VAD 198, the combination logic 145 forms a primary decisionreferred to as vad_prim′ 155 through a logical OR between the primarydecision vad_prim 150 of the primary VAD 140 of the first VAD 199 andthe primary decision referred to as vad_prim_he 190 from the externalVAD 198. This results in that the external VAD 198 can be used to avoidclipping caused by the first VAD 199. Hence, the external VAD 198 isable to correct errors caused by the first VAD 199, which implies thatmissed detected activity by the first VAD 199 can be detected by theexternal VAD 198. In order to avoid increasing excessive activity it isan advantage to use the primary decision of the external VAD.

Turning now to FIG. 5, corresponding to FIG. 2 showing a thirdembodiment. In the third embodiment, the combination logic 145 forms anew primary decision referred to as vad_prim′ 155 through a combinationof the primary decision vad_prim 150 from the first VAD 140 and thefinal 190 a and the primary decisions 190 b from the external VAD. Thisis illustrated in FIG. 5. These three decisions may be combined by usingany combination of AND and/or OR in the combination logic 145. As oneexample it is possible to use the primary decisions of the first and theexternal VADs to be combined with a logical OR before combining with thefinal decision of the external VAD by using a logical AND. Then it wouldbe possible to also detect previously clipped segments.

According to a fourth embodiment VAD decisions from more than oneexternal VAD are used by the combination logic to form that newVad_prim′. The VAD decisions may be primary and/or final VAD decisions.If more than one external VAD is used, these external VADs can becombined prior to the combination with the first VAD. E.g. Vad_prim 86(external_vad_1 & external_vad_2).

In this specification the primary decision of the VAD implies thedecision made by the primary voice activity detector. This decision isreferred to Vad_prim or local VAD. The final decision of the VAD impliesthe decision made by the VAD after the hangover addition. The combinedlogic according to embodiments of the present invention is introduced ina VAD and generates a Vad_prim′ based on the Vad_prim of the VAD and anexternal VAD decision from an external VAD. The external VAD decisioncan be a primary decision and/or a final decision of one or moreexternal VADs. The combined logic is configured to generate theVad_prim′ by applying a logic AND or logic OR on the Vad_prim of thefirst VAD and the VAD decision or VAD decisions from the externalVAD(s).

Referring to FIGS. 3 and 4 which are block diagrams of the first VAD andthe external VAD. The block diagrams show the two VAD's consisting ofthe original VAD (VAD 1) and the external VAD (VAD 2) with combinationlogic for generation of the improved vad_prim in the original VADaccording to embodiments.

As indicated in FIGS. 3 and 4 the two VAD's share the feature extractor.The external VAD may use a modified background update and a primaryvoice activity detector. The modified background update comprises amodification in the background noise update strategy wherein the normalnoise update deadlock recovery is slowed down and adds an alternativepossibility for noise updates to allow the noise estimate to bettertrack the noise. The modified primary voice activity detector may addsignificance threshold and an updated threshold adaptation based onenergy variations of the input. These two modifications may be used inparallel.

To make a primary decision for the first VAD, referred to VAD 1 avariable SNR sum, snr_sum, is compared with a calculated threshold, thr1in order to determine whether the input signal is active speech(localVAD=1 which corresponds to Vad_prim=1) or noise (localVAD=0 whichcorresponds to Vad_prim=0) in prior art as indicated below:

localVAD = 0; if ( snr_sum > thr1 ) {  localVAD = 1; }

Using the combination logic according to embodiments of the presentinvention, a logical AND is applied on the localVAD from the first VADand the final decision from the external VAD, referred to asvad_flag_he. That is, with the use of the combination logic the primaryvoice activity detector is only allowed to become active if both thelocalVAD from the first VAD and vad_flag_he from the external VAD areactive. I.e.,

localVAD = 0; if ( snr_sum > thr1 && vad_flag_he ) {  localVAD = 1; }

The modification has been underlined for easy identification. As thevalue of vad_flag_he is needed the code for the external VAD includingits hangover addition needs to be executed before one can generate themodified VAD 1 decision.

In a fifth embodiment, the combination logic is configured to be signaladaptive, i.e. changing the combination logic depending on the currentinput signal properties. The combination logic could depend on theestimated SNR, e.g. it would be possible to use an even more aggressivesecond VAD if the combination logic is configured such that only theoriginal VAD is used in good conditions. While for noisy conditions theaggressive VAD is used as in embodiment 1. With this adaptation theaggressive VAD could not introduces speech clippings in good SNRconditions, while in noisy conditions it is assumed that the clippedspeech frames are masked by the noise.

One purpose of some embodiments of the present invention is to reducethe excessive activity for non-stationary background noises. This can bemeasured using objective measures by comparing the activity of mixturesencoded. However, this metric does not indicate when the reduction inactivity starts affecting the speech, i.e. when speech frames arereplaced with background noise. It should be noted that in speech withbackground noise not all speech frames will be audible. In some casesspeech frames may actually be replaced with noise without introducing anaudible degradation. For this reason it is also important to usesubjective evaluation of some of the modified segments.

The objective results presented below are based on mixtures of speechwith background noises under varying conditions, with respect todifferent speech samples in several languages for different noiseenvironments and signal to noise ratios (SNR's).

Mixtures were created with different noise samples and with differentSNR conditions. The noises were categorized as Exhibition noise, Officenoise, and Lobby noise as representations for non-stationary backgroundnoises. Speech and noise files were mixed, with the speech level set to−26 dBov and four different SNR's in the range 10-30 dB.

The prepared samples were then processed both by using the codec withthe original VAD according to prior art and with the codec using thecombined VAD solution (denoted Dual VAD) according to embodiments of thepresent invention.

For the objective results the speech activity generated by the differentcodecs using the different VAD solutions are compared and the resultscan be found in the table below. Note that the activity figures in thetable are measured for the complete sample which is 120 seconds each. Atool used for level adjustments of the speech clips indicated that thespeech activity of the clean speech files was estimated to 21.9%.

Table Summary of activity results: total, noise types, and SNR's DualActivity Condition Original VAD reduction All 50.5 34.0 16.5noises/SNR's Exhibition 50.4 35.7 14.7 noise all SNR Office noise all67.1 41.7 25.4 SNR Lobby noise all 33.9 24.4 9.5 SNR 30 dB SNR 29.3 23.45.9 20 dB SNR 43.6 29.1 14.5 15 dB SNR 58.5 37.3 21.2 10 dB SNR 70.646.0 24.6

The results show that one embodiment of the present invention shown inFIG. 3, provides a reduction in activity.

According to one aspect of embodiments, a method in a combination logicof a VAD is provided as illustrated in the flowchart of FIG. 7. The VADis configured to detect voice activity in a received input signal. Asignal from a primary voice detector of said VAD indicative of a primaryVAD decision and at least one signal from at least one external VADindicative of a voice activity decision from the at least one externalVAD are received 1101. The voice activity decisions indicated in thereceived signals are combined 1102 to generate a modified primary VADdecision. The modified primary VAD decision is sent 1103 to a hangoveraddition unit of said VAD to be used for making the final VAD decision.

The voice activity decisions in the received signals may be combined bya logical AND such that the modified primary VAD decision of said VADindicates voice only if both the signal from the primary VAD and thesignal from the at least one external VAD indicate voice.

Moreover, the voice activity decisions in the received signals may alsobe combined by a logical OR such that the modified primary VAD decisionof said VAD indicates voice if at least one signal of the signal fromthe primary VAD and the signal from the at least one external VADindicate voice.

The at least one signal from the at least one external VAD may indicatea voice activity decision from the external VAD which a final and/orprimary VAD decision.

According to another aspect of embodiments, a VAD configured to detectvoice activity in a received input signal is provided as illustrated inFIG. 6. The VAD comprises an input section 502 for receiving a signal150 from a primary voice detector of said VAD indicative of a primaryVAD decision and at least one signal 190 from at least one external VADindicative of a voice activity decision from the at least one externalVAD. The VAD further comprises a processor 503 for combining the voiceactivity decisions indicated in the received signals to generate amodified primary VAD decision, and an output section 505 for sending themodified primary VAD decision 155 to a hangover addition unit of saidVAD. The VAD may further comprise a memory for storing historyinformation and software code portions for performing the method of theembodiments. It should also be noted, as exemplified above, that theinput section 502, the processor 503, the memory 504 and the outputsection 505 may be embodied in a combination logic 145 in the VAD.

According to an embodiment, the processor 503 is configured to combinevoice activity decisions in the received signals by a logical AND suchthat the modified primary VAD decision of said VAD indicates voice onlyif both the signal from the primary VAD and the signal from the at leastone external VAD indicate voice.

According to a further embodiment, the processor 503 is configured tocombine voice activity decisions in the received signals by a logical ORsuch that the modified primary VAD decision of said VAD indicates voiceif at least one signal of the signal from the primary VAD and the signalfrom the at least one external VAD indicate voice.

Modifications and other embodiments of the disclosed invention will cometo mind to one skilled in the art having the benefit of the teachingspresented in the foregoing descriptions and the associated drawings.Therefore, it is to be understood that the embodiments of the inventionare not to be limited to the specific embodiments disclosed and thatmodifications and other embodiments are intended to be included withinthe scope of this disclosure. Although specific terms may be employedherein, they are used in a generic and descriptive sense only and notfor purposes of limitation.

The invention claimed is:
 1. A method in a first voice activitydetector, VAD, for detecting voice activity in a received input signal,the method comprising: receiving a signal from a primary voice detectorof said first VAD indicative of a primary voice activity decision madeby the primary voice detector regarding voice activity in said inputsignal, wherein the primary voice activity decision is an intermediatevoice activity decision of said first VAD in the sense that the primaryvoice activity decision is made by the first VAD without having beenprocessed by a hangover addition unit of said first VAD, receiving oneor more signals from one or more second VADs external to the first VADeach indicative of a voice activity decision made by a respective secondVAD regarding voice activity in said input signal, each second VADcomprising its own primary voice detector and hangover addition unitdistinct from that of said first VAD, combining the voice activitydecisions indicated in the signal received from the primary voicedetector of said first VAD and the one or more signals received from theone or more second VADs to generate a modified primary voice activitydecision, and sending the modified primary voice activity decision to ahangover addition unit of said first VAD that is configured to make afinal voice activity decision of said first VAD.
 2. The method accordingto claim 1, wherein the voice activity decisions in the signals receivedfrom the primary voice detector and the one or more second VADs arecombined by a logical AND, the modified primary voice activity decisionthereby indicating voice only if the signal from the primary voicedetector and each signal from the one or more second VADs indicatevoice.
 3. The method according to claim 1, wherein the voice activitydecisions in the signals received from the primary voice detector andthe one or more second VADs are combined by a logical OR, the modifiedprimary voice activity decision thereby indicating voice if at least onesignal of the signal from the primary voice detector and the one or moresignals from the one or more second VADs indicate voice.
 4. The methodaccording to claim 1, wherein at least one signal from a second VAD is afinal voice activity decision made by that second VAD in the sense thatthe final voice activity decision is made by the second VAD after havingbeen processed by the hangover addition unit of said second VAD.
 5. Themethod according to claim 1, wherein at least one signal from a secondVAD is a primary voice activity decision made by a primary voicedetector of that second VAD, the primary voice activity decision beingan intermediate voice activity decision of the second VAD in the sensethat the primary voice activity decision is made by the second VADwithout having been processed by the hangover addition unit of saidsecond VAD.
 6. The method according to claim 1, comprising receivingonly one signal from one of said second VADs.
 7. The method according toclaim 1, comprising receiving a plurality of signals from a plurality ofsaid second VADs.
 8. The method according to claim 1, wherein the voiceactivity decisions indicated in the signals received from the primaryvoice detector and the one or more second VADs are combined independence on input signal properties.
 9. The method according to claim8, wherein the input signal properties comprise at least one ofestimated signal-to-noise-ratio and background characteristics.
 10. Afirst voice activity detector, VAD, configured to detect voice activityin a received input signal, the first VAD comprising: an input circuitconfigured to: receive a signal from a primary voice detector of saidfirst VAD indicative of a primary voice activity decision regardingvoice activity in said input signal, wherein the primary voice activitydecision is an intermediate voice activity decision of said first VAD inthe sense that the primary voice activity decision is made by the firstVAD without having been processed by a hangover addition unit of saidfirst VAD, and receive one or more signals from one or more second VADsexternal to the first VAD each indicative of a voice activity decisionmade by a respective second VAD regarding voice activity in said inputsignal, each second VAD comprising its own primary voice detector andhangover addition unit distinct from that of said first VAD, a processorcircuit configured to combine the voice activity decisions indicated inthe signal received from the primary voice detector of said first VADand the one or more signals received from the one or more second VADs togenerate a modified primary voice activity decision, and an outputcircuit configured to send the modified primary voice activity decisionto a hangover addition unit of said first VAD that is configured to makea final voice activity decision of said first VAD.
 11. The first VADaccording to claim 10, wherein the processor circuit is configured tocombine the voice activity decisions in the signals received from theprimary voice detector and the one or more second VADs by a logical AND,the modified primary voice activity decision thereby indicating voiceonly if the signal from the primary voice detector and each signal fromthe one or more second VADs indicate voice.
 12. The first VAD accordingto claim 10, wherein the processor circuit is configured to combine thevoice activity decisions in the signals received from the primary voicedetector and the one or more second VADs by a logical OR, the modifiedprimary voice activity decision thereby indicating voice if at least onesignal of the signal from the primary voice detector and the one or moresignals from the one or more second VADs indicate voice.
 13. The firstVAD according to claim 10, wherein at least one signal from a second VADis a final voice activity decision made by that second VAD in the sensethat the final voice activity decision is made by the second VAD afterhaving been processed by the hangover addition unit of said second VAD.14. The first VAD according to claim 10, wherein at least one signalfrom a second VAD is a primary voice activity decision made by a primaryvoice detector of that second VAD, the primary voice activity decisionbeing an intermediate voice activity decision of the second VAD in thesense that the primary voice activity decision is made by the second VADwithout having been processed by the hangover addition unit of saidsecond VAD.
 15. The first VAD according to claim 10, wherein the inputcircuit is configured to receive only one signal from one of said secondVADs.
 16. The first VAD according to claim 10, wherein the input circuitis configured to receive a plurality of signals from a plurality of saidsecond VADs.
 17. The first VAD according to claim 10, wherein the voiceactivity decisions indicated in the signals received from the primaryvoice detector and the one or more second VADs are combined independence on input signal properties.
 18. The first VAD according toclaim 17, wherein the input signal properties comprise at least one ofestimated signal-to-noise-ratio and background characteristics.
 19. Themethod according to claim 1, wherein at least one of the one or moresecond VADs is configured to generate lower activity or introduce lessspeech clipping than the first VAD under certain input conditionscomprising one or more of a certain noise level, a certainsignal-to-noise ratio, and a certain noise characteristic.
 20. Themethod according to claim 1, wherein, under certain input conditions,the primary voice activity decision from the first VAD's primary voicedetector falsely indicates voice activity or clips speech, and whereinsaid combining is performed using combination logic that is adapted tosaid certain input conditions such that the one or more decisions fromthe one or more second VADs only modify the primary voice activitydecision of the first VAD's primary voice detector under said certaininput conditions, wherein said certain input conditions comprise atleast one of a certain noise level, a certain signal-to-noise ratio, anda certain noise characteristic.
 21. The method according to claim 1,wherein said combining comprises combining the primary voice activitydecision made by the primary voice detector of said first VAD, a primaryvoice activity decision made by the primary voice detector of a givenone of the one or more second VADs, and a final voice activity decisionoutput by the hangover addition unit of said given one of the one ormore second VADs.
 22. The method according to claim 1, wherein saidcombining comprises combining the primary voice activity decision madeby the primary voice detector of said first VAD and a primary voiceactivity decision made by the primary voice detector of one of the oneor more second VADs using a first combination logic, and combining theresult with a final voice activity decision output by the hangoveraddition unit of one of the one or more second VADs using a secondcombination logic different from the first combination logic.